Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions

نویسندگان

Ronald J. Williams

Leemon C. Baird

چکیده

Consider a given value function on states of a Markov decision problem, as might result from applying a reinforcement learning algorithm. Unless this value function equals the corresponding optimal value function, at some states there will be a discrepancy, which is natural to call the Bellman residual, between what the value function speciies at that state and what is obtained by a one-step lookahead along the seemingly best action at that state using the given value function to evaluate all succeeding states. This paper derives a tight bound on how far from optimal the discounted return for a greedy policy based on the given value function will be as a function of the maximum norm magnitude of this Bellman residual. A corresponding result is also obtained for value functions deened on state-action pairs, as are used in Q-learning. One signiicant application of these results is to problems where a function approximator is used to learn a value function, with training of the approximator based on trying to minimize the Bellman residual across states or state-action pairs. When 1 control is based on the use of the resulting value function, this result provides a link between how well the objectives of function approximator training are met and the quality of the resulting control.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Information Relaxation Bounds for Infinite Horizon Markov Decision Processes

We consider the information relaxation approach for calculating performance bounds for stochastic dynamic programs (DPs), following Brown, Smith, and Sun (2010). This approach generates performance bounds by solving problems with relaxed nonanticipativity constraints and a penalty that punishes violations of these constraints. In this paper, we study infinite horizon DPs with discounted costs a...

متن کامل

On Integral Operator and Argument Estimation of a Novel Subclass of Harmonic Univalent Functions

Abstract. In this paper we define and verify a subclass of harmonic univalent functions involving the argument of complex-value functions of the form f = h + ¯g and investigate some properties of this subclass e.g. necessary and sufficient coefficient bounds, extreme points, distortion bounds and Hadamard product.Abstract. In this paper we define and verify a subclass of harmonic univalent func...

متن کامل

Information Relaxations, Duality, and Convex Stochastic Dynamic Programs

We consider the information relaxation approach for calculating performance bounds for stochastic dynamic programs (DPs). This approach generates performance bounds by solving problems with relaxed nonanticipativity constraints and a penalty that punishes violations of these nonanticipativity constraints. In this paper, we study DPs that have a convex structure and consider gradient penalties t...

متن کامل

Nested performance bounds and approximate solutions for the sensor placement problem

This paper considers the placement of m sensors at n > m possible locations. Given noisy observations, knowledge of the state correlation matrix, and a mean-square error criterion (equivalently maximizing an efficacy cost criterion), the problem is formulated as an integer programming problem. Computing the solution for large m and n is infeasible, requiring us to look at approximate algorithms...

متن کامل

Learning Weighted Rule Sets for Forward Search Planning

In many planning domains, it is possible to define and learn good rules for reactively selecting actions. This has lead to work on learning rule-based policies as a form of planning control knowledge. However, it is often the case that such learned policies are imperfect, leading to planning failure when they are used for greedy action selection. In this work, we seek to develop a more robust f...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1993

Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions

نویسندگان

چکیده

منابع مشابه

Information Relaxation Bounds for Infinite Horizon Markov Decision Processes

On Integral Operator and Argument Estimation of a Novel Subclass of Harmonic Univalent Functions

Information Relaxations, Duality, and Convex Stochastic Dynamic Programs

Nested performance bounds and approximate solutions for the sensor placement problem

Learning Weighted Rule Sets for Forward Search Planning

عنوان ژورنال:

اشتراک گذاری